Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1657

Merged
merged 11 commits into from
Jun 13, 2024

Conversation

GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

yaooqinn and others added 11 commits June 12, 2024 20:23
### What changes were proposed in this pull request?

This pull request optimizes the `Hex.hex(num: Long)` method by removing leading zeros, thus eliminating the need to copy the array to remove them afterward.
### Why are the changes needed?

- Unit tests added
- Did a benchmark locally (30~50% speedup)

```scala
Hex Long Tests:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Legacy                                             1062           1094          16          9.4         106.2       1.0X
New                                                 739            807          26         13.5          73.9       1.4X
```

```scala
object HexBenchmark extends BenchmarkBase {
  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val N = 10_000_000
    runBenchmark("Hex") {
      val benchmark = new Benchmark("Hex Long Tests", N, 10, output = output)
      val range = 1 to 12
      benchmark.addCase("Legacy") { _ =>
        (1 to N).foreach(x => range.foreach(y => hexLegacy(x - y)))
      }

      benchmark.addCase("New") { _ =>
        (1 to N).foreach(x => range.foreach(y => Hex.hex(x - y)))
      }
      benchmark.run()
    }
  }

  def hexLegacy(num: Long): UTF8String = {
    // Extract the hex digits of num into value[] from right to left
    val value = new Array[Byte](16)
    var numBuf = num
    var len = 0
    do {
      len += 1
      // Hex.hexDigits need to be seen here
      value(value.length - len) = Hex.hexDigits((numBuf & 0xF).toInt)
      numBuf >>>= 4
    } while (numBuf != 0)
    UTF8String.fromBytes(java.util.Arrays.copyOfRange(value, value.length - len, value.length))
  }
}
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46952 from yaooqinn/SPARK-48596.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
…olumnAlias`

### What changes were proposed in this pull request?
Rename `parent` field to `child` in `ColumnAlias`

### Why are the changes needed?
it should be `child` other than `parent`, to be consistent with both other expressions in `expressions.py` and the Scala side.

### Does this PR introduce _any_ user-facing change?
No, it is just an internal change

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46949 from zhengruifeng/minor_column_alias.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…perations

### What changes were proposed in this pull request?
Propagate cached schema in dataframe operations:

- DataFrame.alias
- DataFrame.coalesce
- DataFrame.repartition
- DataFrame.repartitionByRange
- DataFrame.dropDuplicates
- DataFrame.distinct
- DataFrame.filter
- DataFrame.where
- DataFrame.limit
- DataFrame.sort
- DataFrame.sortWithinPartitions
- DataFrame.orderBy
- DataFrame.sample
- DataFrame.hint
- DataFrame.randomSplit
- DataFrame.observe

### Why are the changes needed?
to avoid unnecessary RPCs if possible

### Does this PR introduce _any_ user-facing change?
No, optimization only

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46954 from zhengruifeng/py_connect_propagate_schema.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Simplify the if-else branches with `F.lit` which accept both Column and non-Column input

### Why are the changes needed?
code clean up

### Does this PR introduce _any_ user-facing change?
No, internal minor refactor

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46946 from zhengruifeng/column_simplify.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add docs for SPJ

### Why are the changes needed?
There are no docs describing SPJ, even though it is mentioned in migration notes:  #46673

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Checked the new text

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46745 from szehon-ho/doc_spj.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…a function

### What changes were proposed in this pull request?
Fix the string representation of lambda function

### Why are the changes needed?
I happen to hit this bug

### Does this PR introduce _any_ user-facing change?
yes

before
```
In [2]: array_sort("data", lambda x, y: when(x.isNull() | y.isNull(), lit(0)).otherwise(length(y) - length(x)))
Out[2]: ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
    704 stream = StringIO()
    705 printer = pretty.RepresentationPrinter(stream, self.verbose,
    706     self.max_width, self.newline,
    707     max_seq_length=self.max_seq_length,
    708     singleton_pprinters=self.singleton_printers,
    709     type_pprinters=self.type_printers,
    710     deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
    712 printer.flush()
    713 return stream.getvalue()

File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:411, in RepresentationPrinter.pretty(self, obj)
    408                         return meth(obj, self, cycle)
    409                 if cls is not object \
    410                         and callable(cls.__dict__.get('__repr__')):
--> 411                     return _repr_pprint(obj, self, cycle)
    413     return _default_pprint(obj, self, cycle)
    414 finally:

File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:779, in _repr_pprint(obj, p, cycle)
    777 """A pprint that just redirects to the normal repr function."""
    778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
    780 lines = output.splitlines()
    781 with p.group():

File ~/Dev/spark/python/pyspark/sql/connect/column.py:441, in Column.__repr__(self)
    440 def __repr__(self) -> str:
--> 441     return "Column<'%s'>" % self._expr.__repr__()

File ~/Dev/spark/python/pyspark/sql/connect/expressions.py:626, in UnresolvedFunction.__repr__(self)
    624     return f"{self._name}(distinct {', '.join([str(arg) for arg in self._args])})"
    625 else:
--> 626     return f"{self._name}({', '.join([str(arg) for arg in self._args])})"

File ~/Dev/spark/python/pyspark/sql/connect/expressions.py:962, in LambdaFunction.__repr__(self)
    961 def __repr__(self) -> str:
--> 962     return f"(LambdaFunction({str(self._function)}, {', '.join(self._arguments)})"

TypeError: sequence item 0: expected str instance, UnresolvedNamedLambdaVariable found
```

after
```
In [2]: array_sort("data", lambda x, y: when(x.isNull() | y.isNull(), lit(0)).otherwise(length(y) - length(x)))
Out[2]: Column<'array_sort(data, LambdaFunction(CASE WHEN or(isNull(x_0), isNull(y_1)) THEN 0 ELSE -(length(y_1), length(x_0)) END, x_0, y_1))'>
```

### How was this patch tested?
CI, added test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46948 from zhengruifeng/fix_string_rep_lambda.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…commons-io` called in Spark

### What changes were proposed in this pull request?

This pr replaces deprecated classes and methods of `commons-io`  called in Spark:

- `writeStringToFile(final File file, final String data)` -> `writeStringToFile(final File file, final String data, final Charset charset)`
- `CountingInputStream` -> `BoundedInputStream`

### Why are the changes needed?

Clean up deprecated API usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Passed related test cases in `UDFXPathUtilSuite` and `XmlSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46935 from wayneguow/deprecated.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
…with spark.sql.binaryOutputStyle

### What changes were proposed in this pull request?

In SPARK-47911, we introduced a universal BinaryFormatter to make binary output consistent
across all clients, such as beeline, spark-sql, and spark-shell, for both primitive and nested binaries.

But unfortunately, `to_csv` and `csv writer` have interceptors for binary output which is hard-coded to use `SparkStringUtils.getHexString`. In this PR we make it also configurable.

### Why are the changes needed?

feature parity

### Does this PR introduce _any_ user-facing change?

Yes, we have make spark.sql.binaryOutputStyle work for csv but the AS-IS behavior is kept.

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46956 from yaooqinn/SPARK-48602.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
This PR follows up #46938 and improve the `unescapePathName`.

### Why are the changes needed?
Improve the `unescapePathName` by cut off slow path.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #46957 from beliefer/SPARK-48584_followup.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: beliefer <beliefer@163.com>
### What changes were proposed in this pull request?
The pr aims to upgrade `scala-xml` from `2.2.0` to `2.3.0`

### Why are the changes needed?
The full release notes: https://github.com/scala/scala-xml/releases/tag/v2.3.0

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46964 from panbingkun/SPARK-48609.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
…error class

### What changes were proposed in this pull request?
Track state row validation failures using explicit error class

### Why are the changes needed?
We want to track these exceptions explicitly since they could be indicative of underlying corruptions/data loss scenarios.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests

```
13:06:32.803 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /Users/anish.shrigondekar/spark/spark/target/tmp/spark-6d90d3f3-0f37-48b8-8506-a8cdee3d25d7
[info] Run completed in 9 seconds, 861 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46885 from anishshri-db/task/SPARK-48543.

Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
@GulajavaMinistudio GulajavaMinistudio merged commit ad19577 into GulajavaMinistudio:master Jun 13, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants